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Neural Network reborn 


Renewed interest in 2006: ["A fast learning algorithm for deep belief nets”, Hinton et al. 2006] 
Propose a way to train deep neural nets: 

e Train the first layer. 

e Add a layer on top of it and train only this layer. 

e Repeat the process until the network is deep enough. 


e Use this network as a warm start to train the whole network. 


Technical reasons for this new growing interest: 
e Larger datasets 
e More powerful computers 


e Small number of algorithmic changes 


@ MSE replaced by cross-entropy 
© ReLU (Fukushima, 1975, 1980) 
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Using classical networks for images? 


No, for two reasons: 


e Do not take into account the spatial organization of pixels (if the pixels are 
permuted, the output of the network would be the same, whereas the image would 
change drastically) 


e Non robust to image shifting 


Idea: 
e Apply local transformation to a set of nearby pixels (spatial nature of image is used) 


e Repeat this transformation over the whole image (resulting in a shift-invariant 
output) 


Not a new idea: trace back to perceptron and studies about the visual cortex of a cat. 
The cat is able to 


€ detect oriented edges, end-points, corners (low-level features) 


@ combine them to detect more complex geometrical forms (high-level features) 
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Outline 


@ Foundations of CNN 
e Convolution layer 
e Pooling layer 
e Data preprocessing 
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Convolutional neural networks (CNNs) 


e Neural networks that use convolution instead of matrix product in one of the layers 
e A CNN layer typically includes 3 operations: convolution, activation and pooling 


e Using the more general idea of parameters sharing, instead of full connection 
(convolution instead of matrix product) 


Convolution operator in neural networks is as follows 
Oli j) = (I*K)(iJ) = SOS IG kj DK I) 
kd 


e lis the input and K is called the kernels 


e The kernel K will be learned (replaces the weights W in a fully connected layer) 
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Convolution - Black and White images 


e Size of the input image is 8 x 8 x 1 (height, width, depth) 
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Convolution - Black and White images 
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Convolution - Black and White images 


e Size of the input image is 8 x 8 x 1 (height, width, depth) 


e Size of the kernel is 3 x 3x1 
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Convolution - Black and White images 
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Convolution - Black and White images 


ə Size of the input image is 8 x 8 x 1 (height, width, depth) 


e Size of the kernel is 3 x 3x1 


Convolution - RGB 


© Size of the input image is 8 x 8 x 3 (height, width, depth) 
e Size of the kernel is 3 x 3 x 3 


Tnpk (3 chamade ACR) 


Warning: every filter is small spatially (along width and height), but extends through the 
full depth of the input volume. 


Convolution - RGB 
e Size of the input image is 8 x 8 x 3 (height, width, depth) 


e Size of the kernel is 3 x 3x3 
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Warning: every filter is small spatially (along width and height), but extends through the 
full depth of the input volume. 
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Parameters of convolutional layer 1/4 


Four hyperparameters control the size of the output volume: the kernel size, the depth of 
the output volume, the stride and the zero-padding. 


@ The size of the kernel (typically 3 x 3, 5 x 5). 
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Parameters of convolutional layer 2/4 


Four hyperparameters control the size of the output volume: the kernel size, the depth of 
the output volume, the stride and the zero-padding. 


e The size of the kernel, 


e The depth of the output volume, i.e., the number of filters/activation maps/feature 
maps. 
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Parameters of convolutional layer 3/4 


Four hyperparameters control the size of the output volume: the kernel size, the depth of 
the output volume, the stride and the zero-padding. 


e The size of the kernel, 
e The depth of the output volume, 


e The stride, i.e., of how many pixels do we move the filter horizontally and vertically. 
Usually, stride is equal to one (rarely to two, and even more rarely larger). 
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Parameters of convolutional layer 4/4 


Four hyperparameters control the size of the output volume: the kernel size, the depth of 
the output volume, the stride and the zero-padding. 
e The size of the kernel, 
e The depth of the output volume, 
e The stride, 
e The size of the zero-padding, i.e. the number of zeros we add to the borders of the 
image. This can be used to obtain a constant image size between the input and the 
output. 
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Deep Learning 


How to choose zero-padding? 


Let 


e I the height/width of the input 


O the height/width of the output 


e P the size of the zero-padding 


K the height/width of the filter 


e S the stride 


What is the relation between these quantities? How do we choose the zero-padding to 
obtain an output of the same size as the input? 
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Why convolution? 


e Same transformation applied to all parts of the image (takes into account the spatial 
dependence between pixels and object-shift invariance) 


e Input image contains millions of pixel values, but we want to detect small 
meaningful features such as edges with kernels that use only few hundred of pixels 


e When using a matrix product, all input and output units are connected, whereas 
convolution connects only output neurons with several pixels of the input image. 


Convolution involves weight sharing (a form of regularization) and requires less 
parameters which improves memory, is more statistically efficient and 
computationally faster. 
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Sparse connections 


e Left: when using matrix multiplication, all outputs are connected to all inputs. We 
say that connectivity is dense 


e Right: in a convolution with a kernel of width 3, only three outputs are affected by 
the input x. We say that the connectivity is sparse 
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Outline 
@ Foundations of CNN 


e Pooling layer 
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Pooling 


The Pooling Layer operates independently on every depth slice of the input and resizes it 
spatially, using the max function. 


Parameters: 
e Stride S = 2 
e Spatial extend F = 2 
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Parameters: 
e Stride S = 2 
e Spatial extend F = 2 
Usually, S = F — 2 and more rarely F = 3, S = 2 (overlapping pooling). 
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Pooling 


The Pooling Layer operates independently on every depth slice of the input and resizes it 
spatially, using the max function. 
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Parameters: 
e Stride S = 2 
e Spatial extend F = 2 


Usually, S = F = 2 and more rarely F = 3, S = 2 (overlapping pooling). 
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Pooling 


e Pooling layers compute each pixel of the output as a summary statistic of 
neighboring input pixels at the corresponding location. 


e The most widely used is the max aggregation, called max-pooling 


e Pooling helps the representation to become approximately invariant to small 
translations of the input 


e |f a small translation is applied, output of the layer is almost unchanged 


e Very useful if we care more about the presence of some feature than its position in 
the image: for face detection (presence of eyes is more important than where they 
are) 


e Pooling also allows to handle inputs with different sizes: pictures can have different 
sizes, but the output classification layer must be of fixed size 
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A possible architecture of a CNN 


Consider a grayscale image. Each kernel of the first layer produces one feature map. 
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A possible architecture of a CNN 


The pooling layer operates on each feature map separately. 
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A possible architecture of a CNN 


A convolutional layer operates on the feature maps output by the pooling layer. Each 
kernel is a volume whose depth equals the depth of the input volume. 
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A possible architecture of a CNN 
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Comski 


A possible architecture of a CNN 


At the end of the network, the feature maps are flattened in order to apply a classic 
neural networks. 
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A possible architecture of a CNN 


The full architecture is summarized below. 
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Outline 


@ Foundations of CNN 


e Data preprocessing 
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Data processing 


Normalizing data 

For each channel R, G, B, compute the pixels mean over all images in the whole data set. 
Subtract this value to each channel of each image. — you do not lose relative 
information between images. 


Data augmentation 

Q Sampling ['Imagenet large scale visual recognition challenge", Russakovsky et al. 2015] 

Q Translation/shifting [Deep convolutional neural networks and data augmentation for environmental sound 
classification", Salamon and Bello 2017] 

© Horizontal reflection /mirroring ["Mirror, mirror on the wall, tell me, is the error small?”, H. Yang and Patras 
2015] 

© Rotating ['Holistically-nested edge detection”, Xie and Tu 2015] 

(5) Various photometric transformations [Predicting depth, surface normals and semantic labels with a 


common multi-scale convolutional architecture”, Eigen and Fergus 2015] 


Prediction 

At test time, patches are extracted from the new images together with some of its 
reflection/translation/... A prediction is made for each of these artificial images and they 
are aggregated to make the final prediction. 
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Adding noise - Data augmentation and regularization 


e Add noise to input 
[‘Training with noise is equivalent to Tikhonov regularization”, Bishop 1995] 
[Adding noise to the input of a model trained with a regularized objective", Rifai et al. 2011] 


['Explaining and harnessing adversarial examples", Goodfellow et al. 2014] 


e Add noise to weights 
[An analysis of noise in recurrent neural networks: convergence and generalization", Jim et al. 1996] 


[Practical variational inference for neural networks", Graves 2011] 


e Add noise to output 


['Randomizing outputs to increase prediction accuracy", Breiman 2000] 


e Select the best data transformations (computationally expensive, many re-training 
steps). 


[Transformation pursuit for image classification", Paulin et al. 2014] 
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Outline 


@ Famous CNN 
e LeNet (1998) 
@ AlexNet (2012) 
e ZFNet (2013) 
e VGGNet (2014) 
e GoogLeNet (2014) 
e ResNet (2016) 
e DenseNet (2017) 
e Many other CNN 
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Outline 


@ Famous CNN 
e LeNet (1998) 
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LeNet 


[Generalization and network design strategies”, LeCun et al. 1989] 


['Gradient-based learning applied to document recognition", LeCun et al. 1998] 
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First layer: convolutional layer C1 
e Kernel size = 5 x 5 + a bias 
e Stride = 1 (overlapping contiguous receptive fields) 
e Zero-padding = 0 
e Output: 6 different feature maps, each one resulting from the convolution with a 
kernel 5 x 5 to which the activation function c is applied. 
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Second layer: subsampling/pooling layer S2 
e Type of pooling: averaging. 
e Kernel size = 2 x 2 
e Stride = 2 (non-overlapping receptive fields) 
e Zero-padding = 0 
e Output: one feature map per input feature map resulting from the operation 
o((2 x 2 averaging)w + b). 


Third-layer: convolutional layer C3 
e Warning: this layer operates on several feature maps whereas layer C1 operates on 
the input image (depth — 1). 
e Here each feature map is connected to some specific input feature maps in order to 


» Reduce the number of connections 
> Break the symmetry between the different layers of the network. 


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
0 x x x x x x X X X X 
1 x x x x x x x x x x 
2 x x x x x x x x x x 
3 X x x x x x x x x x 
4 x x x x x x x x x x 
5 x x x x x x x x x x 
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What about the remaining layers 
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e S4: Pooling layer as before 

e C5: Convolutional layer connected to all previous feature maps. 

e F6: fully-connected layer with 84 units 

e Output: a specific layer 
Bi-pyramidal structure: the number of feature maps increases while the spatial resolution 
decreases. 
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Output layer 


Radial Basis function units 
The jth neuron of the output layer computes 


84 
lz- wil = Soa — wa)’, 
i=1 
where z is the vector of size 84 produced by layer F6 and wj = (wj,1,..., wj,84) is the 


weight vector of the jth neuron. 


Gaussian connections 
Assuming that the vector in layer F6 are Gaussian, neuron j outputs the negative log 
likelihood of a Gaussian distribution with mean wj and covariance matrix /. 


In other words, each neuron outputs the square euclidean distance between its parameter 
vector and the input. 


Question. 
How to choose w; € {—1,1}**? 
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Output layer and activation function 


To choose wo € {—1,1}**, use a stylized version of the image of 0 of size 7 x 12 = 84. 
The pixel of this image are the parameters w; of the output neuron j = 0. 


Why do not use a one-hot encodage? 


LeCun et al. 1998 states that it does not work with more than few dozens of classes since 
it requires output units to be off most of the time which is difficult to achieve with 
sigmoid functions. 


Activation function 


o(x) = Atanh(ax), 
where A = 1.7159, a = 2/3. 


— Prevent saturation since neurons outputs belong to {—1, 1} 
s 6(1)=1 
e o(—1) = —1. 
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Criterion to optimize 


Let [f(x)]; = ||z — w;||à be the output of the jth neuron of the output layer, where z is 
the vector produced by layer F6. 
Then the error for one observation (x, y) is defined as 


3 9 
E(0) = BUCO + log (e= sk 5 som, 
j=0 j=0 


where C > 0 is a constant. 


The second term acts as a regularization since it forces the parameters of the neurons 
j Æ y to be far from the input vector of layer F6. 
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Let [fo(x)]; = ||z — w;||3 be the output of the jth neuron of the output layer, where z is 
the vector produced by layer F6. 
Then the error for one observation (x, y) is defined as 


9 9 
= M [6COJ1,=; + log (e= +), a) ; 
j=0 j=0 


where C > 0 is a constant. 


The second term acts as a regularization since it forces the parameters of the neurons 
j Æ y to be far from the input vector of layer F6. 


This is equivalent to 


en le Cl 
E(0) = ( —), 
(6) og C+), alot. 


which is very close to the negative log likelihood of a softmax output layer. 
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Optimization procedure 


Related to stochastic gradient descent: 


git) — gt n OE; 

7 — p- hy 06," 
where Æ; is the loss of a single observation, ņ is the initial learning rate, u a hand-picked 
constant and hy is the jth diagonal element of the Hessian matrix associated to E;. 


The expression of hj; is quite complicated since 0; appears in different connections: 


> p SEM 


(i,m)EV; (k,I)EV; 


where Uim is the connection between units i and m, and V; is the set of pairs (i, m) such 
that the connection between i and m involves the weight 6;. 


An approximation of each diagonal terms hj is performed at the beginning of each epoch, 
using the first 500 observations (whole data set being composed of 60000 observations). 
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Parameters 


Weight initialization: uniform distribution U([—2.4/F;, 2.4/F;]), where F; is the number 
of inputs (fan-in) of the unit which the connection belongs to. 


— Keep the weighted sum in the same range for each unit. 


Gradient descent 


(DE OE; 
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with m = 0.02. 


Optimization lasts 20 epochs: 

e 7 — 0.0005 for the first two epochs, 
n = 0.0002 for the next three epochs, 
1) — 0.0001 for the next three epochs, 
1) — 0.00005 for the next four epochs, 


7 = 0.00001 for the remaining epochs, 
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Results 
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The 82 patterns misclassified by LeNet5. Below each image is displayed the correct answer (left) 


and the prediction (right). These errors are mostly caused by genuinely ambiguous patterns, or 


by digits written in a style that are under represented in the training set. 
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Outline 


@ Famous CNN 


@ AlexNet (2012) 
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AlexNet 


['Imagenet classification with deep convolutional neural networks", Krizhevsky et al. 2012] 


Ingredients: 
e Activation function (ReLU) 
e Local Response Normalization (LRN) 


e Overlapping pooling (3 x 3 window with a stride S = 2 which reduces overfitting) 


Dropout 


Data augmentation 
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ReLU activation function 


According to Krizhevsky et al. 2012, Convolutional neural networks with ReLU activation 
functions can be trained several times faster than the same networks using tanh function. 


Training error rate 


Epochs 


Figure: A four-layer convolutional neural network with ReLU (solid line) reaches a 25% training 
error rate on CIFAR-10 six times faster than an equivalent network with tanh (dashed line). The 
learning rates for each network were chosen independently to make training as fast as possible. 


E. Scornet Deep Learning 


Local Response Normalization/ Brightness normalization 


Let a the activity of a neuron resulting of kernel i applied to the position (x, y) followed 
by a ReLU function and by „ the corresponding renormalized activity which is given by 


min(Q—1,i+q/2) m 
i ; s 
bxy = axy| C+a ) (ay) , 
j=max(0,i—q/2) 
where the sum is taken over q adjacent feature maps at the same spatial position, and Q 
is the total number of feature maps in this layer. 


Constants (determined with validation set): C = 2, q = 5,0 = 10 ^, 8 = 0.75. 


Note that the ordering of feature maps is arbitrary and determined before training. This 
renormalization creates a competition between the different feature maps. 


[What is the best multi-stage architecture for object recognition?", Jarrett et al. 2009] 


They propose a similar normalization procedure where the mean activity is substracted 
(local contrast normalization). 
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Overall architecture 


2048 \ / z038 Mense 


>| >| 
dense dense 


192 192 128 Max 
Max i Max pooling 704 2048 
pooling pooling 


Key-point: architecture is split across two GPU, which, most of the time, do not 
communicate with each other. 


e Connectivity of each convolutional layer 
e ReLu are applied right after all convolutional layers and fully connected layers 


e Local Response Normalization is applied after ReLU in the first and second 
convolutional layer 


e Max-pooling is applied after the first, second and fifth convolutional layers. 
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Optimization 
Initialization: 
e Weights: \V(0, 0.0001) 


ə Biases of second, fourth and fifth convolutional layers and biases of fully connected 
layers set to 1 (seems to accelerate the early stages of learning, prevent dying ReLU 
phenomenon). Other biases are set to 0. 


Stochastic gradient descent with momentum 


ES aero) ( m (a) 
vi) = 9.9v(9 — 9.000556 BX TCR) 


90) — gU Ven) 
with batch size |B| — B — 128. 


The second term in the first equation corresponds to the L2 regularization of the losswith 
a constant A = 0.0005 (weight decay of 0.0005). 
Learning rate is the same for all layers with the following heuristic: 

e Initialization: ņ = 0.01 

e Divide 7 by 10 when the validation error stop improving (done three times here). 


e 90 epochs on 1.2 million images: 6 days. 
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Numerical results 


Model Top-1 (val) Top-5 (val) Top-5 (test) 
SIFT + FVs[7] = = 26.2% 
1CNN 40.7% 18.2% = 
5CNNs 38.1% 16.4% 16.4% 
1CNN* 39.0% 16.6% = 
7CNNs* 36.7% 15.4% 15.3% 


e First line is the second runner-up. 


e Second and third lines are results output by the averaging over 1 or 5 CNN 


described before. 


e Last two lines correspond to networks with an extra convolutional layer after the last 
pooling layer which has been trained on Image Net Fall 2011 then “fine-tuned” on 


the ImageNet 2012 data base. 


AlexNet has a very similar architecture to LeNet, but is deeper, bigger, and features 
Convolutional Layers stacked on top of each other: previously, pooling layers followed 


immediately each convolutional layer. 
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Results 


Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. 
The correct label is written under each image, and the probability assigned to the correct label is also shown 
with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. The 
remaining columns show the six training images that produce feature vectors in the last hidden layer with the 
smallest Euclidean distance from the feature vector for the test image. 
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Outline 


@ Famous CNN 


e ZFNet (2013) 
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ZFNet: Improve upon AlexNet 


[Visualizing and understanding convolutional networks”, Zeiler and Fergus 2014] 


Input data Convl Conv2 Conv3 Conv4 ConvS FC6 FC7 FC8 


la: — 5 ff 


13x13x384 13x 13 x 384 


13x 13 x 256 


27% 27 x 256 


55x 55 x 96 


1000 


227x 227 x3 4096 4096 


Aim at finding out what the different feature maps are searching for in order to obtain a 
better tuning of network architecture. 


In ZFNet, feature maps are not divided across two different GPU. Thus connections 
between layers are less sparse than for AlexNet. 


Deep Learning 


Deconvnet 


Find the pixels that maximize the 
activation of a given feature map. 


How? Invert the network. 


Precisely: 
e Choose a layer 
e Choose a feature map 


e Run the network on a 
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Results 


Top 9 activations in a random subset of feature maps across the validation data, 
projected down to pixel space using the previous deconvolutional network approach. 


Deep Learning 


Results 


Results 


Results 


Remarks 


e strong grouping within each 
feature map, 


e greater invariance at higher 
layers 


e exaggeration of discriminative 
parts of the image, e.g. eyes 
and noses of dogs (layer 4, row 
1, cols 1). 
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Visualization of previous modifications 


(b): 1st layer features from 
Krizhevsky et al. 2012. 


(c): 1st layer features of ZFNet. 
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Visualization of previous modifications 


(b): 1st layer features from 
Krizhevsky et al. 2012. 


(c): 1st layer features of ZFNet. 


: smaller stride (2 vs 4) 
and filter size (7x7 vs 11x11) 


Results in r 
and few 


(c) 
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Visualization of previous modifications 


(d) 


(d): Visualizations of 2nd layer features from Krizhevsky et al. 2012; (e): Visualizations 
of the 2nd layer features of ZFNet. 
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Visualization of previous modifications 


(d) 


(d): Visualizations of 2nd layer features from Krizhevsky et al. 2012; (e): Visualizations 
of the 2nd layer features of ZFNet. 


Feature maps in (e) are cleaner, with no aliasing artefacts that are visible in (d). 
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Conclusion regarding AlexNet 


e First layer filters are a mix of high and low frequency information, with little 
coverage of middle frequencies. 


— Reduced the first layer filter size from 11 x 11 to 7 x 7. 


e Aliasing artifacts are present in second layer because of the large stride of 4 used in 
the first convolutional layer. 


— change the stride from 4 to 2. 


With these modifications: 


e Winner of the ILSVRC 2013 


e Improvement on AlexNet by 


> expanding the size of the middle convolutional layers 
> making the stride and filter size on the first layer smaller. 
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ZF Net final structure 


image size 224 13 
filter size 7 3 
1 R384 256 
lee 2 3x3 max| 
pool 4096 | 4096 
amież units| | units 
6 
Input Image me 
Layer 1 Layer2 Layer 3 Layer 4 Layer 5 Layer6 Layel 


c 

class 

softmax 
r7 Output 


Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as 


the input. This is convolved 


with 96 different Ist layer filters (red), cach of size 7 by 7, using a stride of 2 in bot! 


h x and y. 


The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 


3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 


55 element feature 


maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from 


the top convolutional layer as input in vector form (6 - 6 - 


function, C being the number of classes. AIl filters and feature maps are square in shape. 


Deep Learning 


— 9216 dimensions). The final layer is a C-way softmax 


Results in classification 


Error % Val Top-1 Val Top-5 Test Top-5 
(Gunji et al., 2012) = = 26.2 
(Krizhevsky et al., 2012), 1 convnet 40.7 18.2 -- 
(Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 
(Krizhevsky et al., 2012)*, 1 convnets 39.0 16.6 —-— 
(Krizhevsky et al., 2012)*, 7 convnets 36.7 15.4 15.3 
Our replication of 

(Krizhevsky et al., 2012), 1 convnet 40.5 18.1 —-— 
1 convnet as per Fig. 3 38.4 16.5 == 
5 convnets as per Fig. 3 - (a) 36.7 15.3 15.3 
1 convnet as per Fig. 3 but with 

layers 3, 4, 5: 512,1024,512 maps - (b) 37.5 16.0 16.1 
6 convnets, (a) & (b) combined 36.0 14.7 14.8 
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Occlusion 


(C) Layer 5, strongest (d) Classifier, probability (e) Classifier, most 


probable class 


(a) Input Image. (b) Layer 5, strongest feature map feature map projections of correct class 


u Tresna 
Tems bal 


LI 
Gordon setter 
Wish sener 


Neck brace 


Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature 
maps ((b) & (c)) and classifier output ((d) & (e)) changes. 


(b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded 
image). 


(c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The 
first row example shows the strongest feature to be the dog's face. When this is covered-up the activity in the feature map decreases (blue area in (b)). 


(d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog's face is obscured, the probability for pomeranian 
drops significantly. 


(e): the most probable label as a function of occluder position. 
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Outline 


@ Famous CNN 


e VGGNet (2014) 
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Tiny VGGnet 


[Very deep convolutional networks for large-scale image recognition”, Simonyan and Zisserman 2014b] 


RELU RELU RELU RELU RELU RELU 
wów d al dk 


i 


lids bled ud i 
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Network features 


Convolutional layers: 


e Small receptive field: 3 x 3 (smallest ones capable of capturing the notion of 
top/down, left/right!) 


e Stride of 1 


e Spatial resolution is preserved after convolution 
Max-pooling layers: 
e 2 x 2 kernel 


e Stride of 2 


All hidden layers use ReLU activation functions. 


Local Response Normalization layers do not improve performance. 
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Insightful remark... 


If you stack 3 convolutional layers with receptive fields 3 x 3, you obtain a convolutional 
layer with receptive fields 7 x 7. What is the interest? 


@ Stack of 3 convolutional layers of size 3 x 3: complexity of 3 x x3 x 3 = 27. 
@ One standard convolutional layer of size 7 x 7: complexity of 49. 
In the first case, we cannot obtain every possible layer: the resulting object is a decom- 


position of three consecutive convolutional layers. There are less possibilities hence less 
parameters. 
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VGGNet 


ConvNet Configuration 


A A-LRN B € D E 
11 weight | 11 weight | 13 weight | 16 weight | 16 weight | 19 weight 
layers layers layers layers layers layers 


input (224 x 224 RGB image) 
conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 
LRN conv3-64 conv3-64 conv3-64 conv3-64 

maxpool 
conv3-128 | conv3-128 | conv3-128 | conv3-128 | conv3-128 | conv3-128 
conv3-128 | conv3-128 | conv3-128 | conv3-128 


conv3-256 


5 conv3-256 | conv3-256 
conv3-256 | conv3-256 


56 | conv3-256 
convi-256 | conv3-256 


conv3-256 


conv3-512 | conv3-512 | conv3-512 | conv3-512 | conv3-512 | conv3-512 
conv3-512 | conv3-512 | conv3-512 | conv3-512 | conv3-512 | conv3-512 
conv1-512 | conv3-512 | conv3-512 
conv3-512 


maxpool 
conv3-512 | conv3-512 | conv3-512 | conv3-512 | conv3-512 
conv3-512 | conv3-512 | conv3-512 | conv3-512 
conv1-512 | conv3-512 


maxpool 
FC-4096 
FC-4096 
FC-1000 
soft-max 


Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases 
from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The 
convolutional layer parameters are denoted as “conv(receptive field size)-(number of channels)”. 
The ReLU activation function is not shown for brevity. 
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Parameters 


Initialization: 
e Network A: N(0,0.01) for weights and 0 for biases. 


e For other networks: first four conv layers and last three fully connected layers were 
initialized using network A and the remaining layers were initialized randomly. 


Stochastic gradient descent with momentum 


(EM Oe GNE (9% 
vt) = 9.9v(9 — 9.000550 umor ) 
iE 


D = gO a „ED 
with batch size B = 128. 
Learning rate is the same for all layers with the following heuristic: 
e Initialization: ņ = 0.01 
e Divide 7 by 10 when the validation error stop improving (done three times here). 


e 74 epochs. 


L» penalty with constant 5.107 


ə Dropout regularization for the first two fully connected layers (probability p = 0.5) 
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Results 


Method top-1 top-5 top-5 
val. error val. error test error 
(26) (%) (%) 
VGG (2 nets, multi-crop & dense eval.) 23.7 6.8 6.8 
VGG (1 net, multi-crop & dense eval.) 24.4 Tl 7.0 
VGG (ILSVRC submission, 7 nets, dense eval.) 24.7 7.5 7.3 
GoogLeNet (Szegedy et al., 2014) (1 net) — 
GoogLeNet (Szegedy et al., 2014)(7 nets) -= 
MSRA (He et al, 2014)(11 nets) - - 8.1 
MSRA (He et al., 2014)(1 net) 27.9 9.1 9.1 
Clarifai (Russakovsky et al., 2014) (multiple nets) -= - 11.7 
Clarifai (Russakovsky et al., 2014)(1 net) - — 125 
Zeiler & Fergus (Zeiler & Fergus, 2013) (6 nets) 36.0 14.7 14.8 
Zeiler & Fergus (Zeiler & Fergus, 2013)(1 net) 37.5 16.0 16.1 
OverFeat (Sermanet et al, 2014) (7 nets) 34.0 132 13.6 
OverFeat (Sermanet et al, 2014) (1 net) 35.7 14.2 -= 
Krizhevsky et al. (Krizhevsky et al., 2012)( 5 nets) 38.1 16.4 16.4 
Krizhevsky et al. (Krizhevsky et al., 2012) (1 net) 40.7 18.2 -= 


A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more 


memory and parameters (140M). 


Most of these parameters are in the first fully connected layer, and it was since found 
that these FC layers can be removed with no performance downgrade, significantly 


reducing the number of necessary parameters. 
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Outline 


@ Famous CNN 


e GoogLeNet (2014) 
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GoogLeNet 


[Going deeper with convolutions", Szegedy, W. Liu, et al. 2015] 


Aim. 
Increasing the depth and width of state-of-the-art convolutional neural networks while 
keeping the number of parameters small: 


e Can approximate more complex functions 


e while being robust to overfitting and computationally appealing. 


How. 


Specifically, use of 1 x 1 convolution layers to reduce the number of parameters + apply 
filters of different sizes 3 x 3, 5 x 5 or 3 x 3 max pooling (on each feature maps). 


Details. 
e All convolution layers use ReLU activation functions. 


e Same spatial resolution for each feature map. 
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GoogLeNet - Inception module 


Same spatial resolution for each feature map. 


Use of 1 x 1 convolution layers to reduce the number of parameters then apply filters of 
different sizes 3 x 3, 5 x 5 or 3 x 3 max pooling (on each feature maps). 


EE 
1x1 convolutions. 
: 
3x3 max pooling 
(a) Inception module, naive version (b) Inception module with dimension reductions 


Figure 2: Inception module 
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GoogLeNet - Inception module 


type ary pty depth | #1x1 sej 43x3 ae #5x5 m params | ops 

convolution 7TX1/2 112x112x64 l 2.7K 34M 

max pool 3x3/2 56x56x64 0 

convolution 3x3/1 56x56x192 2 64 192 112K 360M 
max pool 3x3/2 28x28Xx192 0 

inception (3a) 28x 28x 256 2 64 96 128 16 32 32 159K 128M 
inception (3b) 28x28x480 2 128 128 192 3 96 64 380K 304M 
max pool 3x3/2 14X14Xx480 0 

inception (4a) 14x14x512 2 192 96 208 16 48 64 364K 73M 

inception (4b) 14x14x512 2 160 112 224 24 64 64 437K 88M 

inception (4c) 14x14x512 2 128 128 256 24 64 64 463K 100M 
inception (4d) 14x14x528 2 112 144 288 32 64 64 580K 119M 
inception (4e) 14x14x832 2 256 160 320 32 128 128 840K 170M 
max pool 3x3/2 TXTX832 0 

inception (5a) TXTX832 > 256 160 320 32 128 128 1072K 54M 
inception (5b) 7X'7x1024 2 384 192 384 48 128 128 1388K 71M 
avg pool TXT/1 1x1x1024 0 

dropout (40%) 1x1x1024 0 

linear 1x1x1000 1 1000K IM 

softmax 1x1x1000 0 


Table 1: GoogLeNet incarnation of the Inception architecture 


"3x3 reduce" and "5x5 reduce" stands for the number of 1x1 filters in the reduction layer used before the 3x3 


and 5x5 convolutions. One can see the number of 1x1 filters in the projection layer after the built-in 


max-pooling in the pool proj column. 
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Structure of GoogLeNet 


Structure of GoogLeNet 
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Deep network - A concern 


In order to backpropagate gradient, the authors add some auxiliary classifiers connected 
to intermediate layers. 


During training the loss of auxiliary classifiers is weighted by 0.3 and added to the total 
loss of the network. Auxiliary networks are removed at inference time. 


Auxiliary network put after (4a) and (4d): 
e Average pooling layer 5 x 5, stride of 3 
ə A 1 x 1 convolution with 128 filters, with ReLU. 
e A fully connected layer with 1024 neurons and ReLU 
e A dropout layer with a dropout ratio of 70%. 


A linear layer with softmax loss, predicting the same 1000 classes as the main 
classifier. 
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Parameters 


Initialization: 
e Weights are drawn from (0, 1) and biases are set to 0. 


[Deep learning via Hessian-free optimization.", Martens 2010] 


Stochastic gradient descent with momentum 


with batch size B = 200, where 
mo = min(1— qeu 


where Lima € {0, 0.9, 0.99, 0.995, 0.999}. 


Learning rate is the same for all layers with the following heuristic: 
e Initialization: 7 = 0.01 
e Multiply 7 by 0.96 every 8 epochs. 
e Training lasts 125 epochs. 
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Results 
e Polyak averaging is used to create the final model at inference time. 


e 7 different versions of GoogleNet were trained and aggregated to make predictions. 


Team Year | Place | Error (top-5) | Uses external data 
SuperVision || 2012 | Ist 16.4% no 

SuperVision || 2012 | Ist 15.396 Imagenet 22k 
Clarifai 2013 | Ist 11.7% no 

Clarifai 2013 | Ist 11.296 Imagenet 22k 
MSRA 2014 | 3rd 7.3596 no 

VGG 2014 | 2nd 7.32% no 

GoogLeNet | 2014 | Ist 6.67% no 


Table 2: Classification performance 


Number of models || Number of Crops | Cost | Top-5 error | compared to base 
1 1 1 10.07% base 

1 10 10 9.15% -0.92% 

1 144 144 | 7.89% -2.18% 

7 1 7 8.09% -1.98% 

7 10 70 7.62% -2.45% 

7 144 1008 | 6.67% -3.45% 


Main contribution: development of an Inception Module that dramatically reduced the number of 
parameters in the network (4M, compared to AlexNet with 60M). 
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Outline 


@ Famous CNN 


e ResNet (2016) 
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ResNet (2016) 


[Deep residual learning for image recognition", He et al. 2016] 
Statement: Optimization can be hard for some deep networks. 


Solution: Ease optimization by adding simple paths in the network 


weight layer 


x 
identity 


Figure 2. Residual learning: a building block. 


— No extra parameters, no additional computational complexity 
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Literature on shortcut connections 


Early practice for training multi-layer perceptrons was to add a linear layer between the 
inputs and the outputs 


[Pattern recognition and neural networks, Ripley 2007] 


Few intermediate classifiers can also be added in intermediary levels in order to ease the 
optimization: 


© [Going deeper with convolutions", Szegedy, W. Liu, et al. 2015] 


© ['Deeply-supervised nets", Lee et al. 2015] 


Highway networks have shortcut connections with gating functions. Here, gates are data 
dependent and have parameters. 


e ['Highway networks", Rupesh Kumar Srivastava et al. 2015] 


© ['Training very deep networks", Rupesh K Srivastava et al. 2015] 
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General Idea 


Inspired from VGG nets: 


e For the same output feature map size, the layers have the same numbers of filters 


e If the feature map size is halved, then the number of filters is doubled to preserve 
the time complexity per layer 


y = F(x, Wi)-+x, 


where x and y are respectively the input and the output of a (stack of) layer(s), W; are 
the weights of this/these layer(s) and f(x, W;) the output of this/these layer(s). 


If dimensions do not match between x and y, there are two solutions: 
e identity mapping is coupled with extra zero entries padded for increasing dimensions 


9 Projection shortcut is used to match dimensions via 1 x 1 convolution filters 
y = f(x, W;) + Wx, 
where W; is a projection. 


Besides, “when the shortcuts go across feature maps of two different sizes, they are 
performed with a stride of 2”. 
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Structure of ResNet 
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Structure of ResNet 


VGG-19 34-layer plain 34-layer residual 
image image image 
output 
size: 224 
pool, /2 
output 
SDE 3x3 conv, 128 
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2 
pool, /2 pool, /2 
output 3x3 conv, 256 3x3 conv, 256 
size: 56 y 
[| 38cow.256_ | output 
size: 7 pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2 
3x3 conv, 256 
3x3 conv, 512 3x3 conv, 512 
3x3 conv, 512 
33 conv, 512. 
3x3 conv, 512 
pool, /2 
output asy) avg pool avg pool 
obj Bes conv, 512 


3x3 conv, 512 


3x3 conv, 512 
3x3 conv, 128 


Deep Learning 


Parameters 


Initialization, as in He et al. 2015: weights are drawn from NV(0,2/nz) (nz is the number 
of neurons in the previous layer); biases are set to 0. 


Stochastic gradient descent with momentum 


il 
v(t) = 9.9v — 9.000170 — ie SEO) 
icB 
W = gl) 4, yor 
with batch size B = 256. 


Learning rate is the same for all layers with the following heuristic: 
e Initialization: 7 = 0.1 
e Divide 7 by 10 when the validation error stop improving (done three times here). 
e 120 epochs. 
Miscellaneous: 
e Batch normalization after each convolution and before activation 


e No dropout 
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Results 


method top-5 err. (test) 
VGG [41] (ILSVRC' 14) 1.32 
GoogLeNet [44] (ILSVRC' 14) 6.66 
VGG [41] (v5) 6.8 
PReLU-net [13] 4.94 
BN-inception [16] 4.82 
ResNet (ILSVRC'15) 3:57 


e Winner of ILSVRC 2015 
e Special skip connections and heavy use of batch normalization 


e No fully connected layers at the end of the network. 
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Outline 


@ Famous CNN 


e DenseNet (2017) 
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DenseNet 


[Densely Connected Convolutional Networks.", G. Huang et al. 2017] 


Input 


Prediction 


Dense Block 1 


<= 


Dense Block 2 Dense Block 3 


<> mE 


Figure: A deep DenseNet with three dense blocks. The layers between two adjacent blocks are 
referred to as transition layers and change feature-map sizes via convolution and pooling 
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DenseNet 


Figure: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding 
feature-maps as input. 
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Ingredients 


Let x; be the input of the /th layer. Usually, 
x, = fi(xo-1). 


Dense Block. Inside a dense block, 
Xp = fi (Xo, RZE ,X£-1). 
The functions f are composed of three consecutive operations: 


@ First, a batch normalization 
@ Then, activation function ReLU 


@ Finally, 3 x 3 convolutional layer (feature map sizes are kept fixed) 


Transition layers. 
@ Batch normalization 
@ 1x1 convolution 


@ 2 x 2 average pooling 
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Ingredients 


Growth rate k 

If each function f; produces k feature maps, the inputs of the /th layer has ko + k(£ — 1) 
channels. Narrow layers (typically k — 12) give good results. 

— Indeed, each layer has access to each previous layer and thus to the "collective 
knowledge" of the network. 


Bottleneck layer - DenseNet-B 
A way to improve computational efficiency is to introduce 1 x 1 convolutional layers: 
inside dense block, for each layer 


BN - ReLU - Conv (1 x 1) - BN - ReLU - Conv (3 x 3) 
Conv 1 x 1 are set to produce 4k feature maps. 


Compression layer - DenseNet-C 
Throw away a fraction 0 € [0,1] (typically 0 = 0.5) of feature maps at transition layers. 
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Architecture 


Layers Output Size DenseNet-121 DenseNet-169 DenseNet-201 DenseNet-264 
Convolution 112 x 112 7 x 7 conv, stride 2 
Pooling 56 x 56 3 x 3 max pool, stride 2 
Dense Block 56:56 ie Xë ES lxlcon | 6 md E 
(D 3 x 3 conv 3 x 3 conv | 3 x 3 conv 3 x 3 conv 
Transition Layer 56 x 56 1 x I conv 
(1) 28 x 28 2 x 2 average pool, stride 2 
Dense Block 1 x 1 conv 1 x I conv 1 x I conv 1 x 1 conv 
(2) KŚ piss „> MEJ 3 x 3 conv x ebd T. 
Transition Layer 28 x 28 1 x I conv 
(2) 14 x 14 2 x 2 average pool, stride 2 
Dense Block m | 1 x 1 conv x2à | 1 x I conv PTR 1 x I conv m" 1 x 1 conv | "n 
(3) 3 x 3 conv 3 x 3 conv J 3 x 3 conv 3 x 3 conv 
Transition Layer 14 x 14 1 x 1 conv 
(3) EXT 2 x 2 average pool, stride 2 
Dense Block maż |. sone bee T lxlcon | „>, BE 
(4) 3 x 3 conv 3x3conv | | 3.x 3 conv 3 x 3 conv 
Classification 1x1 7 x 7 global average pool 
Layer 1000D fully-connected, softmax 
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Deep Learning 


Parameters 


Initialization, as in He et al. 2015: weights are drawn from N(0,2/nz) (nz is the number 
of neurons in the previous layer); biases are set to 0. 


Stochastic gradient descent with momentum 


il 
v (**9 = 9.9v — 9.000170 — ie SEO) 
icB 
W = gl) 4, yor 
with batch size B = 256. 


Learning rate is the same for all layers with the following heuristic: 
e Initialization: 7 = 0.1 
e Divide 7 by 10 at epoch 30 and 60. 
e 90 epochs. 
Miscellaneous: 
e Batch normalization after each convolution and before activation 


e No dropout 
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DenseNet Results 


Method Depth Params CIO C10+ C100 C100+ SVHN 
Network in Network [22 - - 10.41 8.81 35.68 - 235 
AII-CNN [32] - - 9.08 7,25 - 3934 - 
Deeply Supervised Net [20] - - 9.69 7.97 - 34.57 1.92 
Highway Network [34] - - - 7.72 - 32,39 - 
FractalNet [17] 21 38.6M 10.18 5.22 35.34 23.30 2.01 
with Dropout/Drop-path 21 38.6M 3:33 4.60 28.20 2343 1.87 
ResNet [11] 110 1.7M - 6.61 - - - 
ResNet (reported by [13]) 110 1.7M 13.63 6.41 44.74 27.22 2.01 
ResNet with Stochastic Depth [13] 110 1.7M 11.66 5.23 37.80 24.58 1.75 
1202 10.2M - 4.91 - - - 
Wide ResNet [42] 16 11.0M - 4.81 - 22.07 - 
28 36.5M - 4.17 - 20.50 - 
with Dropout 16 2.7M - - - - 1.64 
ResNet (pre-activation) [12] 164 1.7M 11.26* 5.46 35.58" 24.33 - 
1001 10.2M 10.56* 4.62 33.47* 22.71 - 
DenseNet (k = 12) 40 1.0M 7.00 5.24 27.55 24.42 139 
DenseNet (k — 12) 100 7.0M STI 4.10 23.79 20.20 1.67 
DenseNet (k = 24 100 27.2M 5.83 3.74 23.42 19.25 1.59 
DenseNet-BC (k = 12) 100 0.8M 5.92 4.51 24.15 22:27 1.76 
DenseNet-BC (k = 24) 250 15.3M 5.19 3.62 19.64 17.60 1.74 
DenseNet-BC (k = 40) 190 25.6M - 3.46 - 17.18 - 


Table 2: Error rates (56) on CIFAR and SVHN datasets. k denotes network's growth rate. Results that surpass all competing methods are 
bold and the overall best results are blue. “+” indicates standard data augmentation (translation and/or mirroring). * indicates results run 
by ourselves. All the results of DenseNets without data augmentation (C10, C100, SVHN) are obtained using Dropout. DenseNets achieve 
lower error rates while using fewer parameters than ResNet. Without data augmentation, DenseNet performs better by a large margin. 
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Outline 


@ Famous CNN 


e Many other CNN 
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Inception V2-V3 


Based on GoogLeNet Inception module 
[Rethinking the inception architecture for computer vision", Szegedy, Vanhoucke, et al. 2016] 


Filter Concat 


Filter Concat 


Filter Concat 


3x3 
stride 1 


im 
1x1 | 1x1 ] [e] 1x1 
New ideas: 


Using asymmetric convolutions 1 x n and n x 1 (for n = 3,5,7) can be useful in the 


© 
middle layers of the networks for feature maps of size m x m (for 12 < m < 20). 


e Label smoothing using a uniform distribution over labels 


Deep Learning 
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Xception 


['Xception: Deep learning with depthwise separable convolutions", Chollet 2017] 
Stands for “Extreme Inception” and builds upon Inception module in GoogLeNet. 


Output 
channels 


LE 


Input 


The main ideas: 
e Perform 1 x 1 convolutions 


e Apply 3 x 3 (or other filter size) convolutions to each previous feature map (the one 
created by 1 x 1 convolutions) separately. 


— Decoupled the depth (1 x 1 convolutions) and the spatial transformations 
(convolutions on each feature map separately). 
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Comparison of several CNN 
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[An analysis of deep neural network models for practical applications", Canziani et al. 2016] 
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CNN Taxonomy 


m ue q || Depth based || Multi-Path based | |, Erze M a Mois ne Attention 
PENNE CNNs CNNs " xp SEE CNN 


Squeeze and Residual Attention 


WidcResNet TN Channel Boosted 
Excitation ; Neural Network 
A = CNN using TL - — 
Pyramidal Net Competitive Squeeze Convolutional Block 
X id and Excitation Attention 
Concurrent Squeeze 
and Excitation 


Highway Nets 


See this very detailed review Paper ['A survey of the recent architectures of deep convolutional neural networks”, 
Khan et al. 2020] 
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